Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] stop relying on string concatenation / splitting for cv() eval results #6761

Merged
merged 14 commits into from
Dec 22, 2024

Conversation

jameslamb
Copy link
Collaborator

Contributes to #6748

There are a few activities where lightgbm (the Python package) needs to inspect the output of one or more evaluation metrics on one or more datasets.

For example:

  • early stopping
  • printing evaluation results
  • recording evaluation results in memory (e.g. in a dictionary) for use after training

For train() and other APIs that end up using it, it tracks those using a list of tuples like this (pseudocode):

[
  ({dataset_name}, {metric_name}, {is_higher_better}, {metric_value}).
  ...
]

cv() does something similar. However, its "metric value" is actually a mean of such values taken over all cross-validation folds. Because multiple values are being aggregated, it appends a 5th item with the standard deviation.

[
  ({dataset_name}, {metric_name}, {is_higher_better}, mean({metric_value}), stddev({metric_value}),
  ...
]

Some code in callbacks.py needs to know, given a list of such tuples, whether they were produced by cross-validation or regular train().

To facilitate that while still somewhat preserving the schema for the tuples, the cv() code:

  • concatenates the first and second elements into 1
  • appends the string literal "cv_agg" to the beginning of the tuple

So e.g. ("valid1", "auc", ...) becomes ("cv_agg", "valid1 auc", ...). That happens here:

def _agg_cv_result(
raw_results: List[List[_LGBM_BoosterEvalMethodResultType]],
) -> List[_LGBM_BoosterEvalMethodResultWithStandardDeviationType]:
"""Aggregate cross-validation results."""
cvmap: Dict[str, List[float]] = OrderedDict()
metric_type: Dict[str, bool] = {}
for one_result in raw_results:
for one_line in one_result:
key = f"{one_line[0]} {one_line[1]}"
metric_type[key] = one_line[3]
cvmap.setdefault(key, [])
cvmap[key].append(one_line[2])
return [("cv_agg", k, float(np.mean(v)), metric_type[k], float(np.std(v))) for k, v in cvmap.items()]

Every place dealing with such tuples then needs to deal with that, including splitting and re-combining that second element. Like this:

# split is needed for "<dataset type> <metric>" case (e.g. "train l1")
eval_name_splitted = env.evaluation_result_list[i][1].split(" ")
if self.first_metric_only and self.first_metric != eval_name_splitted[-1]:

This proposes changes to remove that, so that the cv() and train() tuples follow a similar schema and all the complexity of splitting and re-combining names can be removed.

It also standardizes on the names from #6749 (comment)

Notes for Reviewers

This change should be completely backwards-compatible, including with user-provided custom metric function. The code paths here are well-covered by tests (as I found out from many failed tests while developing this 😅 ).

metric_type[key] = one_line[3]
cvmap.setdefault(key, [])
cvmap[key].append(one_line[2])
return [("cv_agg", k, float(np.mean(v)), metric_type[k], float(np.std(v))) for k, v in cvmap.items()]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, removing this "cv_agg" string literal, is the key change... everything else flows from that.

@jameslamb jameslamb changed the title WIP: [python-package] stop relying on string concatenation / splitting for cv() eval results [python-package] stop relying on string concatenation / splitting for cv() eval results Dec 17, 2024
@jameslamb jameslamb marked this pull request as ready for review December 17, 2024 05:56
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for clear refactoring!
Just some minor suggestions.

python-package/lightgbm/callback.py Outdated Show resolved Hide resolved
python-package/lightgbm/callback.py Outdated Show resolved Hide resolved
python-package/lightgbm/engine.py Outdated Show resolved Hide resolved
python-package/lightgbm/engine.py Outdated Show resolved Hide resolved
@jameslamb jameslamb requested a review from StrikerRUS December 18, 2024 06:05
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for working on this and for taking my comments into account!

@jameslamb jameslamb merged commit 4ee0bc0 into master Dec 22, 2024
48 checks passed
@jameslamb jameslamb deleted the python/remove-cv-agg branch December 22, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants